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Protein one-dimensional (ID) structures such as secondary structure and contact number provide 
intuitive pictures to understand how the native three-dimensional (3D) structure of a protein 
is encoded in the amino acid sequence. However, it has not been clear whether a given set of 
ID structures contains sufficient information for recovering the underlying 3D structure. Here 
we show that the 3D structure of a protein can be recovered from a set of three types of ID 
structures, namely, secondary structure, contact number and residue-wise contact order which is 
introduced here for the first time. Using simulated annealing molecular dynamics simulations, 
the structures satisfying the given native ID structural restraints were sought for 16 proteins of 
various structural classes and of sizes ranging from 56 to 146 residues. By selecting the structures 
best satisfying the restraints, all the proteins showed a coordinate RMS deviation of less than 
4A from the native structure, and for most of them, the deviation was even less than 2 A. The 
present result opens a new possibility to protein structure prediction and our understanding of 
the sequence-structure relationship. 



I. INTRODUCTION 

Deciphering how the three-dimensional (3D) structure 
of a protein is encoded into the corresponding amino 
acid sequence is a fundamental step toward understand- 
ing a wide spectrum of complex biological phenomena. 
One approach to this problem is to develop a method 
for structure prediction, and to interpret the encoding 
scheme in terms of model parameters and optimization 
algorithms. However, de novo or ab initio methods for 
3D structure prediction are often too complicated to clar- 
ify the relation between sequence and structure. 

On e-dimensional (ID) structure prediction ijRostl 
120031) is a more intuitive route to understanding the 
sequence-structure relationship. ID structures are 3D 
structural features proj ected onto st rings of residue- wise 
structural assignments llRostl 120031) . which include sec- 
ondary structures (SS), solvent accessibility and contact 
numbers (CN). Although ID structures can show intu- 
itive correspondence between amino acid sequence and 
protein structure, it has not been known whether a given 
set of ID structures is sufficient for uniquely specifying 
the underlying 3D structure. Clearly, SS alone cannot 
specify the 3D structure of a globular protein. Using 
SS and/or other ID structures such as CN, is it possible 
at all to recover t he native st r ucture ? The recent re- 
markable result bv lPorto et al\ l)2004[) suggests that the 
answer is affirmative. They have shown that the princi- 
pal eigenvector of the contact map of a pro tein is essen- 
tially equivalent to the contact map itself ijPorto et all 
2004). Using the correct cont act map, we can safely re - 
cover the native 3D structure ijVendruscolo et aElll997|) . 
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However, when the principal eigenvector is to be used 
for reconstructi n g the contact map using the algorithm 
by iPorto et~al\ l)2004f) . the following strict conditions 
must be met. First, the principal eigenvector must be 
extremely accurate. Second, very strict definitions for 
residue-residue contact (such as those based on an all- 
atom representation) must be used. Third, the target 
protein must be compact and consist of a single domain. 
Lack of one of these conditions will result in combinato- 
rial explosion. It should be also noted that, although the 
principal eigenvector shows a significant correlation with 
the contact number vector, it is difficult to interpret its 
geometrical meaning. Therefore, it is desirable to find ID 
structures which are more robust, easier to understand, 
but still sufficient for the reconstruction of the native 3D 
structure. 

iKabakcioelu et al\ l)2002|) have shown that the number 
of 3D structures that satisfy the native CN is limited. 
The contact number ni of the i-th residue is defined as 
7ii = Ci.j where C^.j is the contact map of the native 
structure of a protein. That is, Ci_j = 1 if the residues 
i and j are in contact, and Cjj = otherwise. In our 
preliminary study, we constructed many 3D structures 
that satisfy the native SS and CN for a small all-a pro- 
tein, and found th at a few percent of the structures were 
highly native- like llKinio et aZJ.120 05'1 , supporting the re- 
sult bv lKabakcioelu et aZJ (|2002T> . However, we have also 
found that it is difficult to recover the native structures 
of larger proteins or those with complex topologies using 
only SS and CN restraints. Therefore, either some very 
powerful optimization techniques or other types of ID 
structures seemed necessary. 

In this paper, we introduce a new kind of ID struc- 
ture called residue- wise contact order (RWCO), and show 
that, given the native SS, CN and RWCO, it is possible 
to recover the native 3D structures of proteins of various 
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FIG. 1 An example of contact number (C N) and r e sidue- 
wise contact order (RWCO). The MolScript ijKraulisl Il99lh 
drawing in the upper panel shows the native fold of Protein 
G (2gbl), in the bottom panel is the corresponding CN (solid 
line, left ordinate) and RWCO (dashed line, right ordinate). 



topologies. The contact order was originally introduced 
to quantify the complexity of the native topology of pro- 
teins to investigate the correl ation between the n ative 
structure and its folding rate ijPlaxco et all 1199^) . As 
such, the contact order is a per-protein quantity. Here, 
we extend the definition of the contact order to make it 
a per-residue quantity. Using the same notation as the 
definition of CN, the residue-wise contact order Oi of the 
i-th residue is defined by Oi — J^. \i — j\Cij. That is, 
the RWCO of a residue is expressed as the sum of se- 
quence separations of contacting residues. An example 
of CN and RWCO is shown in Figure Q We can see that 
CN and RWCO exhibit similar trends, but the value of 
RWCO is larger for the residues making long-range con- 
tacts (e.g., the N- and C-terminal strands in Figure 
and smaller for those making short-range contacts (e.g., 
the central a helix in Figure UJ. As SS and CN, RWCO 
has a clear geometrical meaning, and the combination of 
the three types of ID structures is expected to be more 
tolerant against small perturbations for the reconstruc- 
tion of 3D structures. 



II. MATERIALS AND METHOD 

For searching 3D structures that satisfy the given ID 
structural restraints, we use simulated annealing molec- 
ular dynamics simulations. In the present paper, two 



residues are defined to be in contact if the distance be- 
tween the Cp atoms (or C a atoms in case of glycines) 
is less than 12A. This rather generous cut-off distance 
has been shown to maximize the correlat ion between 
predi cted and observed contact numbers ijKinio et all 
2005). To exclude trivial nearest-neighbor contacts, we 
set C itj = if \i - j\ < 3. To make CN and RWCO 
differentiable with respect to atomic coordinates, we 
slightly modify the definition of residue-residue contact 
by using a sigmoid function of inter-residue distance: 
Ci.j = 1/{1 + exp[w(rij — 12)]} where r,; ■>- is the dis- 
tance between Cp atoms of residues i and j l)Kinio et all 
l2005|) (the parameter w determines the sharpness of the 
sigmoid function, and was set to 3 in this pap er). We used 
the E MBOSS distance geometry program l)Nakai et all 
Il993h with default parameters and modifications for CN 
and RWCO restraint functions. We use an all-atom rep- 
resen tation of proteins de rived from the AMBER force 
field (|Weiner et all Il98ffl . The force field used is the 
so-called distance geometry force field in which all the 
energy terms are expressed as penalty functions includ- 
ing bond lengths, bond angles (1-3 distance), torsion an- 
gles (1-4 distance), short-range (1-4) and long-range (1- 
5) soft repulsions (no attractions) t ogether with chira l 
center and chiral volume restraints (Na kai et aZJ . Il9931 . 
Therefore, if a structure perfectly satisfies the ideal pep- 
tide geometry and all the restraints, the energy value 
should be the minimum value of zero. Disulphide bonds, 
if any, were ignored, and no ligands or co-factors were 
taken into account. 

Sec ondary structures were as signed by the DSSP pro- 
gram llKabsch fc Sander! Il983ft . For a helices, distance 
restraints were imposed on hydrogen-bonding pairs, and 
dihedral angle restraints were imposed on (j) and if) angles. 
For (3 strands, distance restraints were imposed between 
C a atoms within each strand segment, and loose dihedral 
angle restraints for </> and ip angles were also included. 

Given a set of native contact numbers {fii}, the CN 
restraints were imposed as Wn^fiii — hi) 2 where w n 
is a weight factor which was set to 5. Similarly, with 
the native residue- wise contact order {6^}, the RWCO 
restraints was imposed as w 2j(o-t — <5i) 2 with the weight 
factor of 0.5 divided by the sequence length. 

To construct a structure, we first generated a random 
coil which was minimized by 500 steps of the conjugate 
gradient method. Then a canonical molecular dynam- 
ics simulation at a temperature of 1000K was performed 
for 10000 steps, after which the system was cooled by 
2K per 100 steps until the temperature was 100K. Then, 
the system was further cooled by IK per 100 steps down 
to 10K. The molecular dynamics simulations were per- 
formed in four-di mensional space to relax t he mu ltiple 
minima problem l|Havell Il99lt iNakai et all Il993f) . Fi- 
nally, conjugate gradient minimization was applied for 
2000 steps to recover the structure in three-dimensional 
space. This procedure was iterated for 300 times with 
different initial random coils to yield 300 independent 
structures for each target protein. We sorted these struc- 
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TABLE I Summary of 3D structures recovered from ID 
structures.™ 

#/RMSD range c minimum minimum 
Protein (km) 6 [0,2) [2,4) [4,6) energy d RMSD e 
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a Out of 300 generated structures, 100 lowest energy 

structures were selected for the statistics. 

b PDB identifier with sequence length in parentheses. 

c Number of structures resulted in the given range of RMSD 

(A) from the native structure. The notation "[x, y)" 

indicates the RMSD greater than or equal to xK and less 

than yh. 

d RMSD (A) of the structure of the lowest energy with 
energy value (no physical unit) in the parentheses. 
e The minimum RMSD (A) with energy value (no physical 
unit) in the parentheses. 



tures in increasing order of their total energy to select the 
best 100 structures. 

As target proteins, we chose from the Protein Data 
Bank l|Berman et all 120 00) four all-a, four all-/3, five a + 
(3, and three a/j3 proteins whose sequence lengths range 
from 56 to 146 residues (Table |]J first column). These 
structures were arbitrarily selected but so as to include 
proteins of varying structural classes and sizes. 



III. RESULTS AND DISCUSSION 

For 14 out of the 16 target proteins, we obtained re- 
constructed structures whose C a root mean square devi- 
ations (RMSD) from the native structure are less than 
2 A (Table second to fourth columns). Many of them 
exhibit even less than lA RMSD. For two other tar- 



gets, namely 2pcy (plastocyanin) and 1351 (turkey egg 
white lysozyme), we still find structures less than 3.5A 
RMSD. By selecting the structures of the lowest energy, 
we can almost always identify highly native-like struc- 
tures (Tablc[IJ fifth column) . One exception is 2pcy (plas- 
tocyanin), whose "best" structure shows 10. 9A RMSD. 
However, this structure is actually the mirror image of 
the native structure. Applying the mirror image transfor- 
mation to this structure, its RMSD from the native struc- 
ture is 1.4A. Occurrence of mirror image structures is an 
inherent problem of methods which use distance-based 
restraints (CN and RWCO are based on inter-atomic dis- 
tances). Nevertheless, the result for 2pcy suggests that 
it is also possible to obtain structures with less than 2A 
RMSD if we generate a sufficiently large number of struc- 
tures. 

The minimum RMSDs are shown in the rightmost col- 
umn of Table [I] These structures do not always corre- 
spond to those with the lowest energy. Since the average 
values of the total energy, over 300 structures generated, 
are greater by one or two orders of magnitude, most of 
the minimum RMSD structures are significantly close to 
the lowest energy. 

The yield of native-like structures greatly varies de- 
pending on the target protein. The native fold of lutg 
(uteroglobin) is a very simple one with four relatively 
short a helices, and all the 100 selected structures are 
within 2A RMSD from the native structure. On the con- 
trary, only a handful of native-like structures were ob- 
tained for 2pcy (plastocyanin) which has a complex j3 
sandwich topology. In general, it seems to be more dif- 
ficult to obtain native-like structures for proteins with a 
large number of long-range contacts. 

A reason for the relatively low yield of native- like struc- 
ture is the use of a simple simulated annealing method 
for the optimization. Since all the native-like structures 
with less than 2A RMSD exhibit low energy values, the 
restraints used are sufficient for specifying the native- 
like structures, but many structures are trapped in local 
minima during optimization. In fact, we observed that 
setting a high temperature in the initial phase of simu- 
lated annealing increased the yield of native-like struc- 
tures. Therefore, the yield is expected to be even higher 
if we apply more powerful optimization techniques or im- 
proved algorithms. 

As can be seen in Figure ^ CN and RWCO are highly 
correlated with each other. Are they both required to re- 
construct the native structures? Performing calculations 
without using RWCO but following exactly the same pro- 
tocol as above, the total number of native-like structures 
was much smaller (Table ITT1 values before "/")■ We ob- 
tained native-like structures only for small and/or simple 
proteins such as lr69, lutg, 256bA, or lctf. The opti- 
mized structures for larger proteins such as lmba tended 
to form only relatively short-range contacts. Further- 
more, even if the correct native structures were recovered, 
it was difficult to discriminate them by the penalty func- 
tion. A slightly better, but qualitatively similar result 
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TABLE II Summary of 3D structures recovered from ID 
structures without RWCO (values before "/" ) or without CN 
(values after "/") (cf. Table 0. 





#/RMSD range minimum 


minimum 


IT 1 U Lclll 


[0,2) 


[2,4) 


[4,6) energy [A] RMSD[A] 


iroy 


6 / 15 15 / 11 4 / 15 1.3 / 1.2 


1.2 / 0.8 


lutg 


2 / 23 31 / 56 10 / 3 2.0 / 0.9 


1.7 / 0.8 




1/14 


8/3 


2/0 8.8 / 2.1 


1.6 / 1.3 




0/0 


0/4 


0/3 13.3 / 2.3 10.4 / 2.3 


lshg 


0/0 


0/2 


1/6 8.6 / 9.7 


4.1 / 2.7 


lcsp 


0/0 


1/2 


4/2 10.0 / 9.9 


2.8 / 2.9 


lten 


0/0 


0/0 


1 / 10.4 / 13.3 


5.9 / 8.0 


2pcy 


0/0 


0/0 


0/0 13.3 / 13.2 


8.2 / 7.6 


2gbl 


0/0 


0/0 


2/1 6.9 / 7.5 


5.1 / 5.9 


lctf 


11 / 21 


2/6 


7/6 1.5 / 1.1 


1.2 / 0.9 


lvcc 


0/0 


0/0 


3 / 1 10.8 / 12.0 


5.0 / 5.3 


2acy 


0/0 


0/0 


1 / 1 12.4 / 13.2 


5.7 / 5.4 


1351 


0/0 


0/0 


0/0 13.3 / 14.8 


10.5 / 8.5 


lay7B 


0/0 


0/0 


/ 1 10.2 / 10.2 


6.2 / 5.4 


lthx 


0/0 


0/0 


0/0 12.4 / 9.1 


7.4 / 7.1 


3chy 


0/0 


0/0 


0/0 14.9 / 12.0 


6.6 / 9.9 



was obtained when CN was omitted in the calculations 
(Table ITT1 values after "/")■ In this case, compared to 
the case without RWCO, the optimized structures tended 
to contain a comparable or smaller number of contacts, 
but of longer range. From these observations, we con- 
clude that CN and RWCO contain complementary infor- 
mation required to accurately determine the native-like 
structures. 

It is of interest to ask whether SS, CN and RWCO 
uniquely specify the native 3D structure of a protein 
(except for the mirror image). We expect such is the 
case, although we cannot give the definite conclusion 
based on the restraint-based, rather than constraint- 
based, method as used in this study. All the optimized 
structures do satisfy the given ID structural restraints 
to a certain extent, but those with high energies tend 
to contain significant distortions in their local geometry 
and large steric overlaps. Thus, given the native SS, CN 
and RWCO, the number of the structures consistent with 
these restraints as well as the ideal peptide chain geom- 
etry should be very limited. It should be noted that this 
argument probably applies only if the full-atom represen- 
tation is used, otherwise there may exist non-native-like 
structures with low energy values. 

Although we have performed a direct optimization of 
3D structures by imposing ID structural restraints, it 
may be also possible to first reconstruct the contact map 
satisfying the ID restraints, and then recover the 3D 
structure from the contact map. In an initial phase of 
the present study, we applied a dete rministic depth-firs t 
search algorithm similar to that of iPorto' et all l|2004[) . 
However, this method failed to converge. Since both CN 



and RWCO are accumulative quantities, there may not 
be any strategy to efficiently eliminate unsuccessful can- 
didates in early stages of the search. Another possibility 
is applying a Monte Carlo method in contact map space. 
We have applied a varia nt of the multicanonical methods 
llWang fc Landad 120011) . but failed to find a solution ex- 
actly satisfying the ID restraints. Nevertheless, for small 
proteins, thus obtained contact maps that best, but not 
exactly, satisfy the restraints contained at least 30 to 40% 
of the correct native contacts, and appeared similar to 
the native contact map by visual inspection. Therefore, 
it may be possible to use such contact maps to construct 
starting conformations for further optimizations. 

Since the three types of ID structures, SS, CN and 
RWCO, are sufficient for determining the native 3D 
structure, it is possible to predict the native structure of 
a protein if we can accurately predict these ID structures. 
Methods for secondary structure prediction are now quite 
mature and are alre ady routinel y used in de novo 3D 
structure prediction ijRostl l2003[) . We have previously 
developed a method to predict CN from amino acid se- 
quence to a decent accurac y with a correlation coefficient 
of 0.63 llKinio et all . 12005). We have recently developed 
a simple linear regression method for RWCO prediction 
which yields a moderate corr elation of 0.59 between th e 
predicted and native RWCOs l)Kinio fc NishikawaL 12005]) . 
At present, we do not expect that the native 3D structure 
can be obtained by using the predicted ID structures: 
ID predictions of higher accuracies must be achieved. 
Nevertheless, if the accuracies of ID structure prediction 
are sufficiently improved, the missing link between amino 
acid sequence and the native 3D structure of globular 
proteins may be completed. 
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